In many data-driven discriminative tasks, without specific priori knowledge, scientists often start with popular classifier models in machine learning. If works, bingo! A happy ending. the alorithm can be deployed into production and the researcch get published. Otherwise, continue to try other more sophisticated model. If all major models fail, we have to turn to other detection modalities, which usually mean the modality generating the current data set doesn't capture the information we need for the task.
A fundamental question beneath the above "trial and error" (Trial and error is a fundamental method of problem solving. It is characterised by repeated, varied attempts which are continued until success, or until the practicer stops trying) practice:
Does the dataset possess the statistical differences between different groups/classes? or Are the samples drawn from different distributions (generation processes)?
This reveals an implicit and often neglected pre-assessment step in the entire pipeline: the dataset classifiability analysis
It is the lowest possible test error rate in classification which is produced by the Bayes classifier. It is analogous to the irreducible error rate.
Because of noise (inherently stochastic), the error incurred by the oracle prediction model from the true distribution p(x,y) is the Bayes error.
To calculate BER, we can use the Gaussian Bayes model, based on the naive assumption:
The features (in this case, PCs returned from PCA in the latent space) conform to Multivariate Gaussian Distributions.
Each class corresponds to one Gaussian distribution.
The Bayes optimal decision boundary will correspond to the point where two densities are equal.
According to the central limit theorem (CLT), if a statistic is a sum or average of repetitive measurements, it will be approximately normal under certain technical conditions regardless of the distribution of the individual samples. Each wave number in Raman or tof in ms, from physical process, is a an accumulation of measurement of photons/ions/particles. Features are uncorreleted. In micro level, physical process, photons/ions/particles of different frequency/energy/mass/electric charge don't interfere with each other. The coocurrence / correlation between features is a macro relation. i.e., Peak patterns.
Optional: check feature normality, e.g., by Q-Q plot
To calculate BER, we can either use:
We need to integrate over the density that is not the highest one for each point. As there are two densities, we need to sum up the two integrals. We can use numerical integration package, e.g. scipy.integrate
use Gaussian NB classifier, e.g. sklearn.naive_bayes.GaussianNB
predict_proba() returns the probability of the samples for each class in the model
Choose SVM (support vector machine) as the base model and use K-fold CV (cross-validation) to pick up the best model.
CV ensures the classifier has a proper generalization capability (neither underfit or overfit)
Information gain has been used in decision tree. For a specific feature, Information gain (IG) measures how much “information” a feature gives us about the class.
$IG(Y|X) = H(Y) - H(Y|X) $
In information theory, IG answers "if we transmit Y, how many bits can be saved if both sender and receiver know X?" Or "how much information of Y is implied in X?"
Attribute/feature X with a high IG is a good split on Y.
It can be proven that Info Gain = Mutual information
We can use mutual_info_classif to calculate info gain.
Be cautious about the discrete_features parameter : {‘auto’, bool, array_like}, default ‘auto’
If bool, then determines whether to consider all features discrete or continuous. If array, then it should be either a boolean mask with shape (n_features,) or array with indices of discrete features. If ‘auto’, it is assigned to False for dense X and to True for sparse X.
For continous feature, use discrete_features = False.
Notes
The term “discrete features” is used instead of naming them “categorical”, because it describes the essence more accurately. For example, pixel intensities of an image are discrete features (but hardly categorical) and you will get better results if mark them as such. Also note, that treating a continuous variable as discrete and vice versa will usually give incorrect results, so be attentive about that.
True mutual information can’t be negative. If its estimate turns out to be negative, it is replaced by zero.
import cla.metrics
X,y = cla.metrics.mvg(
nobs = 100, # number of observations / samples
md = 0 # distance between means, respect to std, i.e. (mu2 - mu1) / std, or how many stds is the difference.
)
rpy2 3.X may not support Windows. ECoL metrics may not be available. cannot load library 'C:\Program Files\R\R-4.3.1\bin\bin\x64\R.dll': error 0x7e
import matplotlib
matplotlib.rcParams.update({'font.size': 10})
from sklearn.decomposition import PCA
X_pca = PCA(n_components = 2).fit_transform(X)
cla.metrics.plotComponents2D(X_pca, y)
<AxesSubplot:>
X,y = cla.metrics.mvg(
nobs = 100, # number of observations / samples
md = 3 # distance between means, respect to std, i.e. (mu2 - mu1) / std, or how many stds is the difference.
)
from sklearn.decomposition import PCA
X_pca = PCA(n_components = 2).fit_transform(X)
cla.metrics.plotComponents2D(X_pca, y)
<AxesSubplot:>
以上分别产生了类间距(md)为0和 3 std 的两个数据集。PCA降维后的可视化效果符合预期。
We generate two datasets with 0 and 3std between-class distances. The 3std is clearly more seperable than 0 std.
Use sklearn.naive_bayes.GaussianNB
predict_proba() returns the probability of the samples for each class in the model
import cla.metrics
import matplotlib.pyplot as plt
bs = []
for NS in range(1, 15):
b, _ = cla.metrics.BER(X,y, NSigma = NS, save_fig = '')
bs.append(b)
plt.plot(range(1, 15), bs)
plt.ylabel('BER')
plt.xlabel('d')
Text(0.5, 0, 'd')
BER随着类间距增加而减小,在12 std/sigma 左右达到稳定。
BER decreases as between-class distance grows. At about 12std, BER becomes stablized.
bs = []
nobs = [10,20, 50, 100,500, 1000, 2000, 5000, 10000]
for m in nobs:
b,_ = cla.metrics.BER(X,y, NSigma = 2, nobs = int(m), save_fig = '')
bs.append(b)
plt.plot(nobs, bs)
plt.xlabel('sample size / nobs(number of observations)')
plt.ylabel('BER')
Text(0, 0.5, 'BER')
At a fixed between-class distance, BER is influenced by nobs.
Use cv to get the best classifier. It returns accuracy and decision boundary vertices.
dct, *_ = cla.metrics.CLF(X,y, show = True)
dct['classification.ACC']
0.925
We prefer Info Gain over Correlation, because:
Correlation only measures the linear relationship (Pearson's correlation) or monotonic relationship (Spearman's correlation) between two variables.
Mutual information is more general and measures the reduction of uncertainty in Y after observing X. It is the KL distance between the joint density and the product of the individual densities. So MI can measure non-monotonic relationships and other more complicated relationships.
mi,_ = cla.metrics.IG(X, y, show = True)
Synthesis Strategy I (meta-learner):
$ D = w_0 + w_1×BER + w_2×ACC + w_3×IG + … $
Synthesis Strategy II (decomposition):
$ PC1 = w_0 + loading_1×BER + loadinng_2×ACC + loading_3×IG + … $
古井贡酒不同年份拉曼光谱 Raman spectra of Gujing Tribute Liquor of different ages
价格与年份基本正比,以500ml瓶装为例, 5-year 200RMB, 8-year 300RMB, 16-year 600RMB, 26-year 1700RMB
With a long history, Gujing Tribute Liquor with fragrant taste is one of the eight most famous liquors in China. In 196AD, Cao Cao presented the "Jiuyun Spring Liquor" that was produced in his hometown as the royal liquor, as well as its brewing methods to the Emperor Xian of Han Dynasty. During the Wan Li Reign of Ming Dynasty, it was presented to the royal court as a "tribute" all the way until Qing Dynasty, hence the liquor is named "Gujing Tribute Liquor" . On the basis of traditional processes, it has scientific recipes and technological innovations. It features "crystal clear, sweet and mellow like orchid, velvety and lasting after tasting" and brings a unique taste known for its sweetness, aroma and full flavor. It was awarded the gold medal of the national liquor-tasting conference for four times, and won the title of "National Famous Liquor". In March 2003, it was incorporated into the system for protecting original products. In 2005, it became national geographical iconic products, gained wide acclaim and has been popular both at home and abroad. (source: http://english.bozhou.gov.cn/content/33.html)
Additional preprocessing steps (optional)
Highly recommendated for high-dimensional physio-chemical spectroscopic data, e.g., Raman and MALDI-TOF.
Priori knowledge: The Raman spectrum data only contains additive or linear structures (each chemical bond or particle corresponds to several wave numbers), but no complex embedding structures. So non-linear dimensionality reduction methods, such as kernel PCA, LLE, t-SNE, do not suit.
Coefficients from LASSO or ElasticNet depend on the magnitude of each variable. It is therefore necessary to rescale, or standardize, the variables.
The result of centering the variables means that there is no longer an intercept.
Without feature scaling, the feature selection result can be quite different!
from qsi import io
X,y,X_names,_,y_names = io.load_dataset("vintage_526", display = False) # salt
X, X_names = io.pre.x_binning(X, X_names, target_dim=0.1, flavor='max') # flavor = 'sum'
print('处理后数据维度:X.shape = ', X.shape)
load dataset from 7344_Y5Y26.csv X.shape (121, 2089) y.shape (121,) 7344_Y5Y26.csv - Raman spectroscopic profiling dataset of 5-year and 26-year Gujing Tribute vintage liquors. y = 0: 5-year y = 1: 26-year Each sample has 2088 Raman wavenumbers, ranging from 251 to 2338 cm-1. Three outlier samples were removed. -------------------- If you use this data set, please add the reference: [1] A unified classifiability analysis framework based on meta-learner and its application in spectroscopic profiling data [J]. Applied Intelligence, 2021, doi: 10.1007/s10489-021-02810-8 处理后数据维度:X.shape = (121, 208)
from cla.unify import calculate_atom_metrics
import numpy as np
dic = calculate_atom_metrics(mu = X.mean(axis = 0), s = X.std(axis = 0),
mds = np.linspace(0, 3, 4+3*4),
# repeat = 5, nobs = 100,
show_curve = True, show_html = True)
0%| | 0/48 [00:00<?, ?it/s]R[write to console]: Loading required package: ECoL 100%|██████████████████████████████████████████████████████████████████████████████████| 48/48 [30:56<00:00, 31.62s/it]
visualize_dict()
generate_html_for_dict()
| d | 0.0 | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 | 1.2 | 1.4 | 1.6 | 1.8 | 2.0 | 2.2 | 2.4 | 2.6 | 2.8 | 3.0 |
| classification.ACC | 0.932 | 0.975 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Kappa | 0.863 | 0.95 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.F1_Score | 0.93 | 0.975 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Jaccard | 0.883 | 0.952 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Precision | 0.94 | 0.968 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Recall | 0.92 | 0.983 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.McNemar | 0.476 | 0.396 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| classification.McNemar.CHI2 | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf |
| classification.CochranQ | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| classification.CochranQ.T | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| classification.CrossEntropy | 0.235 | 0.169 | 0.021 | 0.009 | 0.014 | 0.041 | 0.058 | 0.065 | 0.051 | 0.042 | 0.036 | 0.029 | 0.026 | 0.023 | 0.02 | 0.018 |
| classification.Mean_KLD | 0.235 | 0.169 | 0.021 | 0.009 | 0.014 | 0.041 | 0.058 | 0.065 | 0.051 | 0.042 | 0.036 | 0.029 | 0.026 | 0.023 | 0.02 | 0.018 |
| classification.AP | 0.942 | 0.995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Brier | 0.071 | 0.042 | 0.001 | 0.0 | 0.001 | 0.005 | 0.006 | 0.005 | 0.003 | 0.002 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0.0 |
| classification.ROC_AUC | 0.948 | 0.995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.PR_AUC | 0.94 | 0.995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.BER | 0.007 | 0.006 | 0.007 | 0.006 | 0.005 | 0.004 | 0.004 | 0.003 | 0.003 | 0.003 | 0.003 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 |
| classification.SVM.Margin | 5.359 | 12.935 | 27.626 | 47.547 | 74.555 | 100.34 | 129.178 | 174.038 | 225.507 | 244.66 | 307.292 | 325.335 | 379.776 | 420.714 | 484.931 | 530.5 |
| correlation.IG.max | 0.097 | 0.108 | 0.131 | 0.167 | 0.198 | 0.234 | 0.271 | 0.326 | 0.389 | 0.433 | 0.47 | 0.508 | 0.562 | 0.58 | 0.614 | 0.642 |
| correlation.r.max | 0.192 | 0.282 | 0.364 | 0.455 | 0.515 | 0.584 | 0.625 | 0.68 | 0.73 | 0.75 | 0.787 | 0.808 | 0.837 | 0.851 | 0.868 | 0.881 |
| correlation.r.p.min | 0.007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| correlation.rho.max | 0.193 | 0.278 | 0.362 | 0.463 | 0.525 | 0.599 | 0.642 | 0.702 | 0.757 | 0.779 | 0.802 | 0.827 | 0.844 | 0.85 | 0.859 | 0.862 |
| correlation.rho.p.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| correlation.tau.max | 0.158 | 0.228 | 0.296 | 0.379 | 0.43 | 0.491 | 0.526 | 0.575 | 0.62 | 0.637 | 0.656 | 0.677 | 0.691 | 0.696 | 0.703 | 0.705 |
| correlation.tau.p.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.ES.max | 0.39 | 0.586 | 0.779 | 1.016 | 1.195 | 1.431 | 1.592 | 1.847 | 2.129 | 2.254 | 2.543 | 2.733 | 3.045 | 3.234 | 3.474 | 3.717 |
| test.student.min | 0.007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.student.min.log10 | -2.199 | -4.308 | -6.97 | -10.888 | -14.154 | -18.895 | -22.309 | -27.699 | -33.915 | -36.577 | -42.675 | -46.381 | -52.8 | -56.457 | -61.002 | -65.474 |
| test.student.T.max | 2.647 | 1.274 | 0.257 | 0.074 | -0.008 | -1.107 | 0.801 | -0.713 | 0.127 | -0.798 | 0.049 | 0.569 | 0.541 | 0.122 | -0.015 | -0.291 |
| test.ANOVA.min | 0.007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.ANOVA.min.log10 | -2.2 | -4.309 | -6.971 | -10.888 | -14.205 | -18.942 | -22.309 | -27.805 | -33.915 | -36.625 | -42.773 | -46.711 | -53.02 | -56.724 | -61.294 | -65.741 |
| test.ANOVA.F.max | 7.627 | 17.261 | 30.495 | 51.75 | 71.406 | 102.472 | 126.72 | 170.709 | 227.302 | 254.138 | 323.477 | 373.73 | 464.177 | 523.2 | 603.549 | 691.646 |
| test.MANOVA | 0.329 | 0.244 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.MANOVA.log10 | -0.808 | -0.764 | -3.932 | -12.044 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 |
| test.MANOVA.F | 1.892 | 1.781 | 9.492 | 17.409 | 32.705 | 48.02 | 72.471 | 89.9 | 143.84 | 141.064 | 198.101 | 267.998 | 283.932 | 378.748 | 425.698 | 412.287 |
| test.MWW.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.MWW.min.log10 | -2.197 | -4.084 | -6.504 | -10.179 | -12.879 | -16.556 | -18.87 | -22.411 | -25.922 | -27.347 | -28.94 | -30.704 | -31.987 | -32.372 | -33.033 | -33.266 |
| test.MWW.U.min | 4014.0 | 3392.667 | 2910.0 | 2328.0 | 1969.667 | 1539.667 | 1293.333 | 945.333 | 628.0 | 503.667 | 370.0 | 226.333 | 124.667 | 94.333 | 43.0 | 25.0 |
| test.KS.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.KS.min.log10 | -2.149 | -4.121 | -5.76 | -9.501 | -11.811 | -15.967 | -17.366 | -23.805 | -28.144 | -31.137 | -35.083 | -38.335 | -41.508 | -45.376 | -49.252 | -52.58 |
| test.KS.D.max | 0.237 | 0.317 | 0.37 | 0.467 | 0.517 | 0.593 | 0.617 | 0.71 | 0.763 | 0.797 | 0.837 | 0.867 | 0.893 | 0.923 | 0.95 | 0.97 |
| test.CHISQ.min | 0.474 | 0.268 | 0.15 | 0.078 | 0.029 | 0.016 | 0.008 | 0.003 | 0.001 | 0.001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.CHISQ.min.log10 | -0.324 | -0.575 | -0.835 | -1.111 | -1.535 | -1.805 | -2.086 | -2.586 | -2.862 | -3.219 | -3.573 | -3.931 | -4.278 | -4.67 | -5.15 | -5.219 |
| test.CHISQ.CHI2.max | 0.513 | 1.238 | 2.115 | 3.121 | 4.756 | 5.84 | 6.988 | 9.076 | 10.243 | 11.768 | 13.288 | 14.837 | 16.347 | 18.062 | 20.174 | 20.475 |
| test.KW.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.KW.min.log10 | -2.198 | -4.086 | -6.507 | -10.182 | -12.883 | -16.561 | -18.875 | -22.416 | -25.928 | -27.353 | -28.946 | -30.711 | -31.993 | -32.379 | -33.039 | -33.273 |
| test.KW.H.max | 7.462 | 15.522 | 26.187 | 42.643 | 54.839 | 71.518 | 82.042 | 98.173 | 114.196 | 120.702 | 127.983 | 136.049 | 141.914 | 143.677 | 146.698 | 147.765 |
| test.Median.min | 0.0 | 0.0 | 0.007 | 0.12 | 1.353 | 1.3 | 1.993 | 0.067 | 0.167 | 0.873 | 0.867 | 0.873 | 0.66 | 0.127 | 1.033 | 0.34 |
| test.Median.min.log10 | -inf | -inf | -inf | -inf | -0.369 | 0.025 | 0.056 | -inf | -inf | -0.687 | -inf | -0.687 | -0.427 | -1.063 | -0.033 | -1.136 |
| test.Median.CH2.max | 1.0 | 1.0 | 0.963 | 0.781 | 0.403 | 0.307 | 0.286 | 0.853 | 0.827 | 0.56 | 0.508 | 0.56 | 0.515 | 0.743 | 0.335 | 0.699 |
| overlapping.F1.mean | 0.058 | 0.105 | 0.192 | 0.286 | 0.368 | 0.445 | 0.515 | 0.584 | 0.638 | 0.689 | 0.724 | 0.76 | 0.785 | 0.805 | 0.821 | 0.832 |
| overlapping.F1.sd | 0.042 | 0.061 | 0.068 | 0.07 | 0.068 | 0.063 | 0.059 | 0.06 | 0.061 | 0.058 | 0.058 | 0.056 | 0.056 | 0.057 | 0.055 | 0.054 |
| overlapping.F1v.mean | 0.713 | 0.7 | 0.693 | 0.692 | 0.696 | 0.709 | 0.731 | 0.746 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 |
| overlapping.F1v.sd | 0.013 | 0.009 | 0.005 | 0.006 | 0.012 | 0.022 | 0.022 | 0.012 | 0.004 | 0.003 | 0.003 | 0.002 | 0.002 | 0.003 | 0.003 | 0.003 |
| overlapping.F2.mean | 1.0 | 0.995 | 0.955 | 0.342 | 0.067 | 0.043 | 0.168 | 0.128 | 0.172 | 0.208 | 0.128 | 0.163 | 0.152 | 0.175 | 0.32 | 0.282 |
| overlapping.F2.sd | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| overlapping.F3.mean | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| overlapping.F3.sd | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| overlapping.F4.mean | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| overlapping.F4.sd | 0.001 | 0.001 | 0.001 | 0.001 | 0.002 | 0.001 | 0.001 | 0.0 | 0.001 | 0.0 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| neighborhood.N1 | 0.009 | 0.02 | 0.01 | 0.013 | 0.025 | 0.016 | 0.02 | 0.001 | 0.011 | 0.006 | 0.01 | 0.007 | 0.012 | 0.012 | 0.015 | 0.014 |
| neighborhood.N2.mean | 0.504 | 0.389 | 0.141 | 0.008 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 |
| neighborhood.N2.sd | 0.501 | 0.487 | 0.348 | 0.09 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 |
| neighborhood.N3.mean | 0.207 | 0.207 | 0.203 | 0.199 | 0.194 | 0.19 | 0.183 | 0.178 | 0.172 | 0.168 | 0.162 | 0.158 | 0.154 | 0.152 | 0.146 | 0.143 |
| neighborhood.N3.sd | 0.011 | 0.011 | 0.011 | 0.01 | 0.01 | 0.01 | 0.01 | 0.011 | 0.012 | 0.012 | 0.013 | 0.014 | 0.015 | 0.016 | 0.017 | 0.019 |
| neighborhood.N4.mean | 0.508 | 0.387 | 0.132 | 0.003 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| neighborhood.N4.sd | 0.501 | 0.488 | 0.339 | 0.047 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| neighborhood.T1.mean | 0.002 | 0.002 | 0.002 | 0.003 | 0.001 | 0.001 | 0.002 | 0.001 | 0.001 | 0.002 | 0.003 | 0.002 | 0.003 | 0.001 | 0.003 | 0.002 |
| neighborhood.T1.sd | 0.028 | 0.033 | 0.022 | 0.035 | 0.015 | 0.008 | 0.033 | 0.012 | 0.007 | 0.032 | 0.047 | 0.027 | 0.04 | 0.009 | 0.049 | 0.025 |
| neighborhood.LSC | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 |
| linearity.L1.mean | 0.157 | 0.153 | 0.157 | 0.145 | 0.142 | 0.128 | 0.123 | 0.11 | 0.1 | 0.087 | 0.085 | 0.078 | 0.07 | 0.067 | 0.06 | 0.057 |
| linearity.L1.sd | 0.151 | 0.147 | 0.151 | 0.139 | 0.136 | 0.123 | 0.119 | 0.106 | 0.096 | 0.083 | 0.082 | 0.075 | 0.067 | 0.064 | 0.058 | 0.054 |
import pickle
import joblib
dic = joblib.load('vintage_526_atom_metrics.pkl')
记录计算得到的atom metrics.
Calculation of atom metrics takes long time. We can persist the result into a pickle file and reload it in future use.
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler
from cla.unify import train_metalearner_logistic, calculate_unified_metric, filter_metrics
_, keys, _, M = filter_metrics(dic, .8)
# as different metrics may have quite different value ranges, perform a standardization.
scaler = StandardScaler()
M = scaler.fit_transform(M)
model = train_metalearner_logistic(M, dic['d'], cutoff = 2)
before filter
Metrics above the threshold (0.8): ['classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'overlapping.F1.mean' 'neighborhood.N3.mean' 'linearity.L1.mean' 'linearity.L1.sd']
from cla.unify import calculate_unified_metric
calculate_unified_metric(X, y, model, keys, method = 'meta.logistic',scaler=scaler)
Exception ignored from cffi callback <function _consolewrite_ex at 0x00000270097FE310>:
Traceback (most recent call last):
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 133, in _consolewrite_ex
s = conversion._cchar_to_str_with_maxlen(buf, n, _CCHAR_ENCODING)
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 138, in _cchar_to_str_with_maxlen
s = ffi.string(c, maxlen).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 0: invalid continuation byte
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
(0.9985843737437561, [5.907182384306342e-07, 3.234180703209244e-06])
This is an all-in-one example. It reuses a pickle data, and doesn't use filter
from qsi import io
X,y,X_names,_,y_names = io.load_dataset("vintage_526", display = False) # salt
X, X_names = io.pre.x_binning(X, X_names, target_dim=0.1, flavor='max') # flavor = 'sum'
print('处理后数据维度:X.shape = ', X.shape)
load dataset from 7344_Y5Y26.csv X.shape (121, 2089) y.shape (121,) 7344_Y5Y26.csv - Raman spectroscopic profiling dataset of 5-year and 26-year Gujing Tribute vintage liquors. y = 0: 5-year y = 1: 26-year Each sample has 2088 Raman wavenumbers, ranging from 251 to 2338 cm-1. Three outlier samples were removed. -------------------- If you use this data set, please add the reference: [1] A unified classifiability analysis framework based on meta-learner and its application in spectroscopic profiling data [J]. Applied Intelligence, 2021, doi: 10.1007/s10489-021-02810-8 处理后数据维度:X.shape = (121, 208)
from cla.unify import analyze
analyze(X,y, filter_threshold = .8, method = 'meta.logistic', pkl = '20221126231414.365863.pkl')
Unable to determine R home: [WinError 2] The system cannot find the file specified
rpy2 3.X may not support Windows. ECoL metrics may not be available. Load atom metrics from 20221126231414.365863.pkl before filter
Metrics above the threshold (0.5): ['classification.CrossEntropy' 'classification.Mean_KLD' 'classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'overlapping.F1.mean' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd'] Score: 1.0 Coef and Intercept: [[-5.73373270e-07 -5.73373270e-07 7.81742374e-09 1.93720782e-02 2.72407895e-05 3.57117946e-05 3.56104936e-05 2.91484452e-05 1.49638818e-04 -2.79887334e-03 -2.78816641e-03 2.58263316e-02 1.98688303e-02 -1.53056212e-03 -3.23749432e-02 -2.18890719e-03 3.90938098e-05 -2.23973768e-04 8.90254823e-04 -1.53082977e-03 6.84549976e-03 3.60751853e-05 2.60570471e-05 4.37587961e-06 6.39158385e-07 3.28793927e-07 3.16148007e-07]] [3.46445846e-05] KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0).
Exception ignored from cffi callback <function _consolewrite_ex at 0x00000246E4556940>:
Traceback (most recent call last):
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 133, in _consolewrite_ex
s = conversion._cchar_to_str_with_maxlen(buf, n, _CCHAR_ENCODING)
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 138, in _cchar_to_str_with_maxlen
s = ffi.string(c, maxlen).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 0: invalid continuation byte
KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 0 , in-class unified metric = 5.412587631243967e-05 KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 1 , in-class unified metric = 0.001343643563679917
(1.0, [5.412587631243967e-05, 0.001343643563679917], '20221126231414.365863.pkl')
analyze(X,y, use_filter = True, method = 'decompose.pca', pkl = '20221126231414.365863.pkl')
Load atom metrics from 20221126231414.365863.pkl before filter
Metrics above the threshold (0.5): ['classification.CrossEntropy' 'classification.Mean_KLD' 'classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'overlapping.F1.mean' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd']
Explained Variance Ratios for the first three PCs [8.18354850e-01 1.80890834e-01 5.36143576e-04] KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 0 , in-class unified metric = 520.9017735032552 KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 1 , in-class unified metric = 512.5762879171997 before scaling: -2.507408392083686e+306 [520.9017735032552, 512.5762879171997] PC1 range: -1911.3099069259176 3166.1374349557695 after scaling: 1.0 [0.52097747 0.52261717]
(1.0, array([0.52097747, 0.52261717]), '20221126231414.365863.pkl')
Conclusion: The between-class classifiability metric is much bigger than the in-class value.
from qsi import io
X,y,X_names,_,y_names = io.load_dataset("salt", x_range=list(range(400,1400)))
X, X_names = io.pre.x_binning(X, X_names, target_dim=0.1, flavor='max') # flavor = 'sum'
print('处理后数据维度:X.shape = ', X.shape)
load dataset from 7545.csv X.shape (125, 1000) y.shape (125,)
食盐拉曼光谱
["jinzhihaiyan(zhongyan):精制海盐(中盐集团)加碘盐",
"shenjinkuangyan(zhongyan):深井矿盐(中盐集团)加碘盐",
"aozhoutianranhaiyan(huaiyan):澳洲天然海盐(淮盐)未加碘",
"aozhouxueyan(huaiyan):澳洲雪盐(淮盐)未加碘"]
We use the 2nd and 3rd classes. 1 - well salt (China), 2 - sea salt (Austrilia)
仪器名称: 激光拉曼光谱 Laser Raman Spectrometer
生产厂家: 美国 Enwave Optronics U.S.A.
仪器型号: Prott-ezRaman-d3
测试参数: 激光波长 785 nm
激光功率 450 mW Max模式
CCD -85℃
积分时间30s
处理后数据维度:X.shape = (125, 100)
from cla.unify import analyze
analyze(X,y, use_filter = True, pkl = '20221127191822.217812.pkl', method = 'meta.logistic')
Load atom metrics from 20221127191822.217812.pkl before filter
Metrics above the threshold (0.5): ['classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.student.T.max' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'test.Median.min' 'test.Median.min.log10' 'overlapping.F1.mean' 'overlapping.F1.sd' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd'] Score: 1.0 Coef and Intercept: [[ 1.35800829e-08 1.79080650e-02 2.27082498e-05 2.68941122e-05 2.76123428e-05 2.26016767e-05 1.11499735e-04 -2.06499142e-03 -5.61227832e-04 -2.06887367e-03 1.91052481e-02 1.35331494e-02 -1.19610027e-03 -2.73930910e-02 -1.89001196e-03 3.23139517e-05 -1.80724310e-04 7.22407390e-04 -1.19630776e-03 5.35223776e-03 3.71357695e-03 2.72674190e-05 3.00809899e-08 1.98636069e-05 3.30494068e-06 5.05136354e-07 1.85420103e-06 3.70840205e-06]] [2.64054056e-05] c = 1 , in-class unified metric = 0.0001808874752863014 c = 2 , in-class unified metric = 0.0007227458921617193
(4.341332031428809e-13, [0.0001808874752863014, 0.0007227458921617193], '20221127191822.217812.pkl')
from cla.unify import analyze
analyze(X,y, use_filter = True, pkl = '20221127191822.217812.pkl', method = 'decompose.pca')
Load atom metrics from 20221127191822.217812.pkl before filter
Metrics above the threshold (0.5): ['classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.student.T.max' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'test.Median.min' 'test.Median.min.log10' 'overlapping.F1.mean' 'overlapping.F1.sd' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd']
Explained Variance Ratios for the first three PCs [8.26779438e-01 1.72092940e-01 6.23388247e-04] c = 1 , in-class unified metric = 518.018235293348 c = 2 , in-class unified metric = 529.3232315241899 before scaling: 1078.5798570230465 [518.018235293348, 529.3232315241899] PC1 range: -1952.4088786029863 3117.744458116934 after scaling: 0.4021899271419477 [0.51275101 0.51052129]
(0.4021899271419477, array([0.51275101, 0.51052129]), '20221127191822.217812.pkl')
Compared to the previous dataset, this dataset is much harder to classify.
import sklearn
import scipy
import statsmodels
import seaborn
import numpy
import pandas
import qsi
import clams
print(sklearn.__version__,
scipy.__version__,
statsmodels.__version__,
seaborn.__version__,
numpy.__version__,
pandas.__version__,
qsi.__version__,
clams.__version__)
1.0.2 1.7.3 0.13.2 0.11.2 1.21.5 1.4.2